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Abstract 

Sparse distributed representation is the key to 
learning useful features in deep learning algo¬ 
rithms, because not only it is an efficient mode of 
data representation, but also - more importantly 
- it captures the generation process of most real 
world data. While a number of regularized auto¬ 
encoders (AE) enforce sparsity explicitly in their 
learned representation and others don’t, there has 
been little formal analysis on what encourages 
sparsity in these models in general. Our objec¬ 
tive is to formally study this general problem for 
regularized auto-encoders. We provide sufficient 
conditions on both regularization and activation 
functions that encourage sparsity. We show that 
multiple popular models (de-noising and con¬ 
tractive auto encoders, e.g.) and activations (rec¬ 
tified linear and sigmoid, e.g.) satisfy these con¬ 
ditions; thus, our conditions help explain sparsity 
in their learned representation. Thus our theoret¬ 
ical and empirical analysis together shed light on 
the properties of regularization/activation that are 
conductive to sparsity and unify a number of ex¬ 
isting auto-encoder models and activation func¬ 
tions under the same analytical framework. 


1. Introduction 

Sparse Distributed Representation (SDR) (Hinton, 1984) 
constitutes a fundamental reason behind the success of 
deep learning. On one hand, it is an efficient way of repre¬ 
senting data that is robust to noise; in fact, some of the main 
advantages of sparse distributed representation in the con¬ 
text of deep neural networks has been shown to be informa¬ 
tion disentangling and manifold flattening (Bengio et al., 
2013), as well as better linear separability and representa¬ 
tional power (Glorot et al., 2011). On the other hand, and 
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more importantly, SDR captures the data generation pro¬ 
cess itself and is biologically inspired (Hubei & Wiesel, 
1959; Olshausen & Fieldt, 1997; Patterson et al., 2007), 
which makes this mode of representation useful in the first 
place. 

For these reasons, our objective in this paper is to inves¬ 
tigate why a number of regularized Auto-Encoders (AE) 
exhibit similar behaviour, especially in terms of learning 
sparse representations. AEs are especially interesting for 
this matter because of the clear distinction between their 
learned encoder representation and decoder output. This 
is in contrast with other deep models where there is no 
clear distinction between the encoder and decoder parts. 
The idea of AEs learning sparse representations (SR) is 
not new. Due to the aforementioned biological connec¬ 
tion between SR and NNs, a natural follow-up pursued by 
a number of researchers was to propose AE variants that 
encouraged sparsity in their learned representation (Lee 
et al., 2008; Kavukcuoglu & Lecun, 2008; Ng, 2011). On 
the other hand, there has also been work on empirically 
analyzing/suggesting the sparseness of hidden representa¬ 
tions learned after pre-training with unsupervised models 
(Memisevic et al., 2014; Li et al., 2013; Nair & Hinton, 
2010). However, to the best of our knowledge, there has 
been no prior work formally analyzing why regularized 
AEs learn sparse representation in general. The main chal¬ 
lenge behind doing so is the analysis of non-convex objec¬ 
tive functions. In addition, questions regarding the efficacy 
of activation functions and the choice of regularization on 
AE objective are often raised since there are multiple avail¬ 
able choices for both. We also try to address these ques¬ 
tions with regards to SR in this paper. 

We address these questions in two parts. First, we prove 
sufficient conditions on AE regularizations that encour¬ 
age low pre-activations in hidden units. We then analyze 
the properties of activation functions that when coupled 
with such regularizations result in sparse representation. 
Multiple popular activations have these desirable proper¬ 
ties. Second, we show that multiple popular AE objectives 
including de-noising auto-encoder (DAE) (Vincent et al., 
2008) and contractive auto-encoder (CAE) (Rifai et al., 
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2011b) indeed have the suggested form of regularization; 
thus explaining why existing AEs encourage sparsity in 
their latent representation. Based on our theoretical analy¬ 
sis, we also empirically study multiple popular AE models 
and activation functions in order to analyze their compara¬ 
tive behaviour in terms of sparsity in the learned representa¬ 
tions. Our analysis thus shows why various AE models and 
activations lead to sparsity. As a result, they are unified un¬ 
der a framework uncovering the fundamental properties of 
regularizations and activation functions that most of these 
existing models possess. 

2. Auto-Encoders and Sparse Representation 

Auto-Encoders (AE) (Rumelhart et al., 1986; Bourlard & 
Kamp, 1988) are a class of single hidden layer neural net¬ 
works trained in an unsupervised manner. It consists of an 
encoder and a decoder. An input (x G M n ) is first mapped 
to the latent space with h = / e (x) = s e (Wx + b e ) is the 
hidden representation vector, s e is the encoder activation, 
W G M mxn is the weight matrix, and b e G M m is the 
encoder bias. Then, it maps the hidden output back to the 
original space by y = fd( h) = Sd(W T h) where y is the 
reconstructed counterpart of x and Sd is the decoder activa¬ 
tion. The objective of a basic auto-encoder is to minimize 
the following with respect to the parameters {W, b e } 

J AB = E x [£(x,/ d (/ e (x)))] (1) 

where £(■) is the squared loss function. The motivation be¬ 
hind this objective is to capture predominant repeating pat¬ 
terns in data. Thus although the auto-encoder optimization 
learns to map an input back to itself, the focus is on learning 
a noise invariant representation (manifold) of data. 

2.1. Part I: What encourages sparsity during 
Auto-Encoder training? 

2.1.1. Sparsity and our assumption 

Learning a dictionary adapted to a set of training data such 
that the latent code is sparse is generally formulated as 
the following optimization problem (Olshausen & Fieldt, 
1997) 

N 

min Y, (llxi - W r hi|| 2 + A||h i || 1 ) (2) 

The above objective is convex in each one of W and h 
when the other is fixed and hence it is generally solved al¬ 
ternately in each variable while fixing the other. Note that 
l\ penalty is the driving force in the above objective and 
forces the latent variable to be sparse. 

This section analyses the factors that are required for spar¬ 
sity in AEs. Note that in (2) we optimize for a different 
parameter h* for each corresponding sample. In the case 


of AEs, we do not have a separate parameter that denotes 
the hidden representation corresponding to every sample 
individually. Instead the hidden representation for every 
sample is a function of the sample itself along with other 
network parameters. So in order to define the notion of 
sparsity of hidden representation in AEs, we will treat each 
hidden unit hi = s e (WjX + b e .) as a random variable 
which itself is a function of the random variable x. Then 
the average activation fraction of a unit is the (probability) 
mass of (data) distribution for which the hidden unit acti¬ 
vates. For finite sample datasets, this becomes the fraction 
of data samples for which the unit activates. 

Also note that SDR dictates that all representational units 
participate in data representation while very few units ac¬ 
tivate for a single data sample. Thus a major difference 
between SDR and SR is that of dead units (units that do 
not activate for any data sample) since sparsity can in gen¬ 
eral also be achieved when most units are dead. However, 
the latter scenario is undesirable because it does not truly 
capture SDR. Thus we model and study the conditions that 
encourage sparsity in hidden units; and we also empirically 
show these conditions are capable of achieving SDR. 

For our analysis, we will use linear decoding which ad¬ 
dresses the case of continuous real valued data distribu¬ 
tions. We will now show that both regularization and ac¬ 
tivation function play an important role for achieving spar¬ 
sity. In order to do so, we make the following assumption, 

Assumption 1. We assume that the data x is drawn from a 
distribution x ~ X for which E x [x] = 0 < 2 /?dE x [xx T ] = I 
where I is the identity matrix. 

Further, let r^. = x — W T / e (x) denote the reconstruction 
residual during auto-encoder training at any iteration tfor 
training sample x. Then we assume every dimension of r^. 
is i.i.d. random variable following a Gaussian distribution 
with mean 0 and standard deviation cr r . 

Before proceeding, first we establish an important condi¬ 
tion needed by AEs for exhibiting sparse behaviour. Con¬ 
sider the pre-activation of an AE 

a*=W*x + 6‘, (3) 

Here j and t denote the j th hidden unit and t th training 
iteration respectively, and W*- denotes the j th row of W. 
Then notice when Assumption 1 is true, if we remove the 
encoding bias from the AE optimization, the expected pre¬ 
activation becomes E x [a*] = E x [W*-x] = 0 uncondition¬ 
ally for all iterations. Consider any activation function 
s e (.) with activation threshold £ m in> he. any data sam¬ 
ple with j th pre-activation of- would de-activate the unit if 
of- <= S m i n and activate it otherwise. Then the only way 
for a unit to exhibit sparse behaviour (over a data distri¬ 
bution) when the expected pre-activation is always zero, is 
for the majority of the samples to have pre-activation be¬ 
low S m i n . Then, in order for the average to be zero, the 
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minority above the threshold will have taken larger values 
on average compared to the majority. However, this strat¬ 
egy limits the degree of sparsity that a unit can achieve for 
any given data distribution following Assumption 1 , when 
the weight lengths are upper bounded because the pre¬ 
activation value also become upper bounded. The bounded 
weight length condition is desired in practice for conver¬ 
gence and is achieved by regularizations like weight de¬ 
cay and Max-Norm (Hinton et al., 2012). Thus, in order 
for hidden units to exhibit sparse behaviour, encoding bias 
needs to be a part of AE optimization. 

Having established the importance of encoding bias, we 
make the following deduction based on the above assump¬ 
tion, 

Lemma 1. If assumption 1 is true, and encoding acti¬ 
vation function s e (.) has first derivative in [0,1], then 
dJ AE /db ej E [-2a r ^\\Wjl2a r ^\\Wj\\]. 

Using the above result, the theorem below gives a sufficient 
condition on regularization functions needed for forcing the 
average pre-activation value (E[(a^)]) to keep on reducing 
after every training iteration. 

Theorem 1. Let {W* E M mxn , b* E M m } be the param¬ 
eters of a regularized auto-encoder (X >0) 

Jrae = Jae + XIZ(W , b e ) (4) 


at training iteration t with regularization term 7£(W, b e ), 
activation function s e (.) and define pre-activation a = 
w‘x + b\. (thus h) = s e (o$)J. //Ajj- > 2<T r v^||Wj||, 

where j E {1, 2,..., m}, then updating {W^ b^} along 
the negative gradient of Jrae, results in E x [a^ +1 ] < 
E x [aj] and Var[a* +1 ] = ||'W* +1 1| 2 for all t > 0. 


Interpretation: The important thing to notice in the 
above theorem is that larger values of A is expected to lead 
to lower expected pre-activation values since, 


Ex [af 1 ]=E x [a*]- ?? (^ + A^) 


db e 


(5) 


where rj is the learning rate. But this may not be true in 
general over multiple iterations due to terms in that 

depend on weight vectors that also change every iteration 
depending on the value of A. However, we are generally 
interested in the direction of the weight vectors during re¬ 
construction instead of their scale. Thus if we fix the length 
of weight vectors (to say, unit length), then the term 

° e j 

will be bounded by a fixed term w.r.t. weight vectors and 
will only depend on the bias and data distribution. Un¬ 
der these circumstances, increasing the value of A is con¬ 
ducive to lower expected pre-activation if Jlp- is strictly 

° e j 

greater than zero. On the other hand, if Jp- = 0, then 

ex¬ 


changing the value of A should not have significant ef¬ 
fect on expected pre-activation values, especially when the 
weight length is fixed. In the case when the weight length 
is not fixed, changing the value of A will affect the value 
of weight length, which in turn will affect the term d Q h AE 

° e j 

which also affects expected pre-activation of a unit; but 
this effect is largely unpredictable depending on the form 
of d Q b AE . In the next section, we will connect the notions 

° e j 

of expected pre-activation and sparsity, for activation func¬ 
tions with certain properties which will extend the above 
arguments to the sparsity of hidden units. 

Finally, in the relaxed cases when weight lengths are not 
constrained to have a fixed length, an upper bound on 
weight vectors’ length can easily be guaranteed using Max- 
norm Regularization or Weight Decay which are widely 
used tricks while training deep networks (Hinton et al., 
2012). In the prior case every weight vector is simply con¬ 
strained to lie within an I 2 ball (|| WjH 2 < c \/j E [m\, 
where c is a fixed constant) after every gradient update. 


Having shown the property of regularization functions that 
encourages lower pre-activations, we now introduce two 
classes of regularization functions that inherit this property 
and thus manifest the predictions made above. 


Corollary 1. If s e is a non-decreasing activation function 
with first derivative in [0,1] and IZ = /(E x [ftj])/or 

any monotonically increasing function /(.), then 3A > 0 
such that updating {W^M} along the negative gradient 
of Jrae results in E x [a* + ] < E x [a*-] and Var[a* +1 ] = 
||W* +1 || 2 /ora//£>0. 


Corollary 2. If s e is a non-decreasing convex activa¬ 
tion function with first derivative in [0,1] and IZ = 


E* 


£7-. ((§£)'» W *IE) 


, q E N , p E 


then 


3A > 0 such that updating {W t ,bg} along the nega¬ 
tive gradient of Jrae, results in E x [a* +1 ] < E x [a*-] and 
Var[a* +1 ] = || W* +1 \\ 2 for all t > 0. 


Above corollaries show that specific regularizations en¬ 
courage the pre-activation of every hidden unit in AEs to 
reduce on average, with assumptions made only on acti¬ 
vation function and the first/second order statistics of the 
data distribution. We will show in Section 2.2 that multiple 
existing AEs have regularizations of the form above. 


2.1.2. Which activation functions are good for 
Sparse Representation? 

The above analysis in general suggests that non-decreasing 
convex activation functions encourage lower expected pre¬ 
activation for regularization in both corollaries. Also 
note that a reduction in the expected pre-activation value 
(E[(a*)]) does not necessarily imply a reduction in the hid¬ 
den unit value (hj) and thus sparsity. However, these regu¬ 
larizations become immediately useful if we consider non- 
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decreasing activation functions with negative saturation at 
0, i.e., lim^-oo s e (a) = 0. Now a lower average pre¬ 
activation value directly implies higher sparsity! 

Before proceeding, we would like to mention that although 
the general notion of sparsity in AEs entails majority of 
units are de-activated, i.e., their value is less than a certain 
threshold (5 m { n ), in practice, a representation that is truly 
sparse (large number of hard zeros) usually yields better 
performance (Glorot et al., 2011; Wright et al., 2009; Yang 
et al., 2009). Extending the argument of theorem 1, we 
obtain: 

Theorem 2. Letp* denote a lower bound of Pr(/i*- < S m [ n ) 
at iteration t and s e (.) be a non-decreasing function with 
first derivative in [0,1]. If || Wj ||2 is upper bounded inde¬ 
pendent of X then 3S C R + and 3T m i n ,T max e N such 
that pW > p) VA e S, T min <t< T max . 

The above theorem formally connects the notions of ex¬ 
pected pre-activation and expected sparsity of a hidden 
unit. Specifically, it shows that the usage of non-decreasing 
activation functions lead to lower expected pre-activation 
and thus a higher probability of de-activated hidden units 
when theorem 1 applies. This result coupled with the prop¬ 
erty lim^-oo s e (a) = 0 (de-activated state) implies the 
average sparsity of hidden units keeps increasing after a 
sufficient number of iterations (T min ) for such activations. 
Notice that convexity in s e (.) is only desired for regular¬ 
izations in corollary 2. Thus in summary, non-decreasing 
convex s e (.) ensure <97 Z/db ej is positive for regularizations 
in corollary 1 and 2, which in turn encourages low expected 
pre-activation for suitable values of A. This finally leads to 
higher sparsity if lim^-oo s e {a) = 0. 

Notice we derive the strict inequality (E x K +1 ] < Ex[a$]) 
in Theorem 1 (and used in Theorem 2) even though the 
corollaries suggest non-decreasing convex activations im¬ 
ply the relaxed case (E x [a* +1 ] < E x [a*]). This is done for 
two reasons: a) ensure sparsity monotonically increases for 
iterations T m i n <t< T max , b) the condition d1Z/db ej = 0 
(which results in E x [a^ +1 ] < E x [a*-]) is unlikely for acti¬ 
vations with non-zero first/second derivatives because the 
term 7 Z (above corollaries) depends on the entire data dis¬ 
tribution. 

The most popular choice of activation functions are ReLU, 
Maxout(Goodfellow et al., 2013), Sigmoid, Tanh and Soft- 
plus. Maxout and Tanh are not applicable to our framework 
as they do not satisfy the negative saturation property. 

ReLU: It is a non-decreasing convex function; thus both 
corollary 1 and 2 apply. Note ReLU does not have a sec¬ 
ond derivative 1 . Thus, in practice, this may lead to poor 
sparsity for the regularization in Corollary 2 due to lack of 

1 In other words, d 2 hj / da 2 = S (W j x + b e . ), where <5 (.) is the Dirac delta 

function. Although strictly speaking, d 2 hj/da 2 is always non-negative, this value 
is zero everywhere except when the argument is exactly 0, in which case it is +oo 


bias gradients from the regularization, i.e. <97 Z/db ej = 0. 
On the flip side, the advantage of ReLU is that it enforces 
hard zeros in the learned representations. 

Softplus: It is a non-decreasing convex function and hence 
encourages sparsity for the suggested AE regularizations. 
In contrast to ReLU, Softplus has positive bias gradi¬ 
ents (hence better sparsity for corollary 2) because of its 
smoothness. On the other hand, note that Softplus does not 
produce hard zeros due to asymptotic left saturation at 0. 

Sigmoid: Corollary 1 applies unconditionally to Sigmoid, 
while corollary 2 doesn’t apply in general. Hence Sigmoid 
is not guaranteed to lead to sparsity when used with regu¬ 
larizations of form specified in Corollary 2. 

Notice all the above activation functions have their first 
derivative in [0,1] (a condition required by lemma 1). In 
conclusion, Maxout and Tanh do not satisfy the negative 
saturation property at 0 and hence do not guarantee spar¬ 
sity, all others- ReLU, Softplus and Sigmoid- have proper¬ 
ties (at least in principle) that encourage sparsity in learned 
representations for the suggested regularizations. 

2.2. Part II: Do existing Auto-Encoders learn Sparse 
Representation? 

At this point, a natural question to ask is whether existing 
AEs learn Sparse Representation. To complete the loop, we 
show that most of the popular AE objectives have regular¬ 
ization term similar to what we have proposed in Corollar¬ 
ies 1 and 2 and thus they indeed learn sparse representation. 


1) De-noising Auto-Encoder (DAE): DAE (Vincent 
et al., 2008) aims at minimizing the reconstruction error 
between every sample x and the reconstructed vector using 
its corresponding corrupted version x. The corrupted ver¬ 
sion x is sampled from a conditional distribution p(5q|x^). 
The original DAE objective is given by 

JdAE = E x [Ep( X | x ) [^(x, /d(/e(x)))]] (6) 

where p(x$|x) denotes the conditional distribution of x 
given x. Since the above objective is analytically in¬ 
tractable due to the corruption process, we take a second 
order Taylor’s approximation of the DAE objective around 
the distribution mean /i x = E p ( x | x ) [x] in order to overcome 
this difficulty, 

Theorem 3. Let {W,b e } represent the parameters of a 
DAE with squared loss, linear decoding, and i.i.d. Gaus¬ 
sian corruption with zero mean and a 2 variance, at any 
point of training over data sampled from distribution V. 
Let dj := WjX + b e . so that hj = s e (aj) corresponding 
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to sample x V. Then, 


3) Marginalized De-noising Auto-Encoder (mDAE): 

mDAE (Chen et al., 2014) objective is given by: 


JdAE = Jae + CT 2 E 


X 



+E 


+ E 

j,k=] 

(b d + W T h - x) T W T 


dhj dhk 


ddj dak 


(wjw k y 


fd 2 h 

- a 


V 9a 2 


© w* © w 1 


+o(a 2 ) 


(7) 


where G M m A element-wise 2 nd derivative of h 
w.r.£. a and 0 is element-wise product. 


The first term of the above regularization is of the form 
stated in corollary 2. Even though the second term doesn’t 
have the exact suggested form, it is straight forward to 
see that this term generates non-negative bias gradients 
for non-decreasing convex activation functions (and should 
have behaviour similar to that predicted in corollary 2). 
Note the last term depends on the reconstruction error 
which practically becomes small after a few epochs of 
training and the other two regularization terms take over. 
Besides, this term is usually ignored as it is not positive- 
definite. This suggests that DAE is capable of learning 
sparse representation. 


G' ©a 2 < fdhA 2 

1=1 J = 1 J 

( 10 ) 

where • denotes the corruption variance intended for the 
i th input dimension. The authors of mDAE proposed this 
algorithm with the primary goal of speeding up the training 
of DAE by deriving an approximate form that omits the 
need to iterate over a large number of explicitly corrupted 
instances of every training sample. 

Remark 2. Let {W,b e } represent the parameters of a 
mDAE with linear decoding, squared loss and crE = A 
Vi, at any point of training over data sampled from some 
distribution V. Then, 


JmDAE = Jae + -E x 


JmDAE = Jae A- AE X 



IIW,-112 


(ii) 


Apart from justifying sparsity in the above AEs, these 
equivalences also expose the similarity between DAE, 
CAE and mDAE regularization as they all follow the form 
in corollary 2. Note how the goal of achieving invariance 
in hidden and original representation respectively in CAE 
and mDAE show up as a mere factor of weight length in 
their regularization in the case of linear decoding. 


2) Contractive Auto-Encoder (CAE): CAE (Rifai et al., 
2011b) objective is given by 

Jcae = Jae + X^[\\J(A\\f] (8) 

where J(x) = denotes the Jacobian matrix and the ob- 
jective aims at minimizing the sensitivity of the hidden rep¬ 
resentation to slight changes in input. 

Remark 1. Let {W,b e } represent the parameters of a 
CAE with regularization coefficient A, at any point of train¬ 
ing over data sampled from some distribution V. Then, 



Thus CAE regularization also has a form identical to the 
form suggested in corollary 2. Thus the hidden representa¬ 
tion learned by CAE should also be sparse. In addition, 
since the first order regularization term in Higher order 
CAE (CAE+H) (Rifai et al., 2011a) is the same as CAE, 
this suggests that CAE+H objective should have similar 
properties in term of sparsity. 


4) Sparse Auto-Encoder (SAE): Sparse AEs are given 
by: 

m 

Jsae = Jae + AV (plog(p/pj) 

U < 12 ) 

+(1 -p) log((l — p)/(l — Pi))) 

where pj = E x [hj] and p is the desired average activation 
(typically close to 0). Thus SAE requires one additional 
parameter (p) that needs to be pre-determined. To make 
SAE follow our paradigm, we set p = 0 and thus tuning the 
value of A would automatically enforce a balance between 
the final level of average sparsity and reconstruction error. 
Thus the SAE objective becomes 

rri 

Jsae = Jae~ A y^log(l-/»,■) (when p = 0) (13) 

j =i 

Note for small values of pj , log(l — pj) ~ —pj. Thus the 
above objective has a very close resemblance with sparse 
coding (equation 2, except that SC has a non-parametric 
encoder). On the other hand, the above regularization has 
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a form as specified in corollary 1 which we have showed 
enforces sparsity. Thus, although it is expected of the SAE 
regularization to enforce sparsity from an intuitive stand¬ 
point, our results show that it indeed does so from a more 
theoretical perspective. 

3. Empirical Analysis and Observations 

We use the following two datasets for our experiments: 

1. MNIST (Lecun & Cortes): It is a 10 class dataset of 
handwritten digit images of which 50, 000 images are pro¬ 
vided for training. 

2. CIFAR-10 (Krizhevsky, 2009): It consists of 60,000 
32 x 32 color images of objects in 10 classes. For CIFAR- 
10, we randomly crop 50, 000 patches of size 8 x 8 for 
training the auto-encoders. 

Experimental Protocols: Since neural network (NN) 
optimization is non-convex, training with different opti¬ 
mization conditions (eg. learning rate, data scale and mean, 
gradient update scheme e.t.c.) can lead to drastically differ¬ 
ent outcomes. However, one of the very things that make 
training NNs difficult is well designed optimization strate¬ 
gies without which they do not learn useful features. Our 
analysis is based on certain assumptions on data distribu¬ 
tion and conditions on weight matrices. Thus in order to 
empirically verify our analysis, we use the following ex¬ 
perimental protocols that make the optimization well con¬ 
ditioned. 

For all experiments, we use mini-batch stochastic gra¬ 
dient descent with momentum (0.9) for optimization, 
50 epochs, batch size 50 and hidden units 1000. We 
train DAE, CAE, mDAE and SAE (using eq. 13) with 
the same hyper-parameters for all the experiments. For 
regularization coefficient (cr 2 ), we use the values in the set 
{0,0.001,0.1 2 ,0.2 2 ,0.3 2 ,0.4 2 ,0.5 2 ,0.6 2 ,0.7 2 ,0.8 2 ,0.9 2 , 
1.0} for all models except DAE where a 2 values represent 
the variance of Gaussian noise added. For all models and 
activation functions, we use squared loss and linear de¬ 
coding. We initialize the bias to zeros and use normalized 
initialization (Glorot & Bengio, 2010) for the weights. 
Further, we subtract mean and divide by standard deviation 
for all samples. 2 

Learning Rate (LR): Too small a LR won’t move the 
weights from their initialized region and the convergence 
would be very slow. On the other hand, if we use too large 
a learning rate, it will change weight direction very drasti¬ 
cally (may diverge), something we don’t desire for our pre¬ 
dictions to hold. So, we find a middle ground and choose 

2 

We noticed in case of MNIST, it is important to add a large number (0.1) to 
the standard deviation before dividing. We believe this is because MNIST (being 
binary images with uniform background) does not follow our assumption on data 
distribution. 


LR in the range (0.001, 0.005) for our experiments. 

Terminology: We are interested in analysing the sparsity 
of hidden units as a function of regularization coefficient 
cr 2 through out our experiments. Recall that our notion of 
sparsity 2.1 is denoted by the fraction of data samples that 
deactivate a hidden unit instead of the fraction of hidden 
units that deactivate for a given data sample. This choice 
was made in order to treat each hidden unit as a random 
variable. Since we cannot identify a particular hidden unit 
across auto-encoders trained with different values of cr 2 , 
the only way for measuring the level of sparsity in auto¬ 
encoder units is compute the Average Activation Fraction , 
which is defined as follows: 

Avg.Act.Fraction = ^ > * min) (14) 

N x m 

Here 1(.) is the indicator operator, hi- denotes the j th hid¬ 
den unit for the i th data sample, and S m [ n is the activa¬ 
tion threshold. In the case ReLU, S m i n = 0, and in the 
case of Sigmoid and Softplus, S m [ n = 0.1. Also N and 
m denote the total number of data samples and number of 
hidden units respectively. Notice sparsity of a hidden unit 
is inversely related to the average activation fraction for a 
single unit. Thus our definition of Avg. Activation Fraction 
is the indicator of average sparsity across all hidden units. 
Finally, while measuring Avg. Activation Fraction during 
training, we also keep track of fraction of dead units. Dead 
units are those hidden units which deactivate for all data 
samples and are thus unused by the network for data recon¬ 
struction. Notice while achieving sparsity, it is desired that 
minimal hidden units are dead and all alive units activate 
only for a small fraction of data samples. 

3.1. Sparsity when Bias Gradient is zero 

One of the main predictions made based on theorem 1 is 
that the sparsity of hidden units should remain unchanged 
with respect to a 2 when the bias gradient = 0 and 

° e j 

weight lengths are fixed to a pre-determined value because 
the expected pre-activation becomes completely indepen¬ 
dent of cr 2 . Notice this prediction only accounts for change 
in sparsity as a result of change in expected pre-activation 
of the corresponding unit. Sparsity can also increase when 
expected pre-activation for that unit is fixed, as a result of 
change in weight directions such that majority samples take 
pre-activation values below activation threshold while the 
minority takes values above it such that the overall expected 
value remains unchanged. This change in weight directions 
is also affected by a 2 since regularization functions speci¬ 
fied in corollary 2 and 1 contain both weight and bias terms. 
However, the latter factor contributing to change in sparsity 
is unpredictable in terms of changing a 2 values. Hence it 
is desired for sparsity to be largely affected only when bias 
gradient is present for better predictive power. 





Why Regularized Auto-Encoders Learn Sparse Representation? 


„ ReLU on MNIST with ||W 3 || 2 =ljs{l,2,...,m} 

E l|ps • • »-• • 7 




mDAE act 
mDAE dead 
i DAE act 
DAE dead 


U, ReLU on CIFAR-10 with ||W 3 || 2 =lJe{l,2,...,m} 


"5 0.6 
| 0.5 


c°.3f. , 
| 0.2 


< mDAE act 
mDAE dead 
DAE act 
DAE dead 


°0o 



ReLU on CIFAR-10 


' mDAE act 
mDAE dead 
DAE act 
DAE dead 
CAE act 
CAE dead 




0.2 0.4 0.6 


Figure 1. Trend of average activation fraction vs. a 2 with weight 
length constraint using ReLU on MNIST (left) and CIFAR-10 
(right). 


Hence we analyse the effect of regularization coefficient 
(cr 2 ) on the sparsity of representations learned by AE mod¬ 
els using ReLU activation function with weight lengths 
constrained to be one. Notice ReLU has zero bias gradi¬ 
ent for CAE and mDAE, but also for the equivalent regu¬ 
larization derived for DAE 3. The plots are shown in figure 
l . 3 

We see that the effect of bias gradient largely dominates 
the behaviour of hidden units in terms of sparsity. Specif¬ 
ically, as predicted, average activation fraction (and thus 
sparsity) remains unchanged with respect to regularization 
coefficient cr 2 when ReLU is applied to CAE and mDAE 
due to the absence of bias gradient. 

We also analyse the effect of regularization coefficient (cr 2 ) 
on the sparsity of representations learned by AE models 
using ReLU activation functions when weight lengths are 
not constrained. These plots can be seen in fig 2. We find 
that the trend becomes unpredictable for both CAE and 
mDAE (both datasets have different trends). As discussed 
after theorem 1, without weight length constraint, cr 2 af¬ 
fects weight length which in turn affects d Q h AE that changes 

e j 

the value of expected pre-activation. However, this effect 
is unpredictable and thus undesired. 

On the other hand, we see that for DAE, in the constrained 
length case (fig 1), the number of dead units start rising 
only after the average activation fraction reaches around 
0.05. However, in case of unconstrained weight length, 
ReLU does not go below the avg. activation fraction of 0.1. 
This shows that constrained weight length achieves higher 
level of sparsity before giving rise to dead units. 

In summary, we find that bias gradient dominates the be¬ 
haviour of hidden units in terms of sparsity. Also, these 
experiments suggest we get both more predictive power 
and better sparsity with hidden weights constrained to have 
fixed (unit) length. Notice this does not restrict the useful¬ 
ness of the representation leaned by auto-encoders since we 
are only interested in the filter shapes, and not their scale. 

3 

For weight length constrained to 1, CAE and mDAE objectives become equiv¬ 
alent. 


Figure 2. Trend of average activation fraction vs. a 2 with¬ 
out weight length constraint using ReLU on MNIST (left) and 
CIFAR-10 (right). 

3.1.1. Why is DAE affected by a 2 when ReLU 

HAS ZERO BIAS GRADIENT? 

The surprising part of the above experiments is that DAE 
has a stable decreasing sparsity trend (across different val¬ 
ues of a 2 ) for ReLU although DAE (similar to CAE, 
mDAE) has a regularization form given in corollary 2. The 
fact that ReLU practically does not generate bias gradients 
from this form of regularization brings our attention to an 
interesting possibility: ReLU is generating the positive bias 
gradient due to the first order regularization term in DAE. 
Recall that we marginalize out the first order term in DAE 
(during Taylor’s expansion, see proof of theorem 3) while 
taking expectation over all corrupted versions of a training 
sample. However, the mathematically equivalent objective 
of DAE obtained by this analytical marginalization is not 
what we optimize in practice. While optimizing with ex¬ 
plicit corruption in a batch-wise manner, we indeed get a 
non-zero first order term, which does not vanish due to fi¬ 
nite sampling (of corrupted versions); thus explaining spar¬ 
sity for ReLU. We test this hypothesis by optimizing the 
explicit Taylor’s expansion of DAE (eDAE) with only the 
first order term on MNIST and CIFAR-10 using our stan¬ 
dard experimental protocols: 

JeDAE = E x [^(x, / d (/ e (x))) + (x - x) T V*£] 

where x is a Gaussian corrupted version of x. The acti¬ 
vation fraction vs. corruption variance (cr 2 ) for eDAE is 
shown in figure 3 which confirms that the first order term 
contributes towards sparsity. On a more general note, lower 
order terms (in Taylor’s expansion) of highly non-linear 
functions generally change slower (hence less sensitive) 
compared to higher order terms. In conclusion we find 
that explicit corruption may have advantages at times com¬ 
pared to marginalization because it captures the effect of 
both lower and higher order terms together. 



Figure 3. Activation fraction vs. a 2 for eDAE. 
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Figure 4. Trend of average activation fraction vs. a 2 with weight 
length constraint using Sigmoid activation MNIST (left) and 
CIFAR-10 (right). 


Figure 5. Trend of average activation fraction vs. a 2 without 
weight length constraint using Sigmoid activation MNIST (left) 
and CIFAR-10 (right). 


3.2. Sparsity when Bias Gradient is positive 

As predicted by theorem 1, if the bias gradient is strictly 
positive (-§^~ > 0), then increasing the value of a 2 should 

lead to smaller expected pre-activation and thus increasing 
sparsity. This is specially true when the weight lengths are 
fixed to some length. This is because term may de- 

pend on weight length (depending on the regularization) 
which is also affected by a 2 . However, since this effect is 
hard to predict, sparsity may not always be proportional to 
a 2 for un-constrained weight length. 

In order to verify these intuitions, we first analyse the effect 
of regularization coefficient (a 2 ) on the sparsity of repre¬ 
sentations learned by AE models using Sigmoid 45 activa¬ 
tion function with weight lengths constrained to one. The 
plots are shown in figure 4. These plots show a stable in¬ 
creasing sparsity trend with increasing regularization coef¬ 
ficient as predicted by our analysis. 

Finally, we now analyse the effect of regularization coef¬ 
ficient (a 2 ) on the sparsity of representations learned by 
AE models using Sigmoid activation function when weight 
lengths are unconstrained. These plots are shown in figure 
5. As mentioned above, unconstrained weight length leads 
to unpredictable behaviour of sparsity with respect to regu¬ 
larization coefficient. This can be seen for mDAE and CAE 
for both datasets (different trends). 

In summary, we again find that weight lengths constrained 
to have some fixed value lead to better predictive power 
in terms of sparsity. However in either case, the empirical 
observations substantiate our claim that sparsity in auto¬ 
encoders is dominated by the effect of bias gradient from 
regularization instead of weight direction. This explains 
why existing regularized auto-encoders learn sparse rep¬ 
resentation and the effect of regularization coefficient on 
sparsity. 

4 

Due to lack of space and because Softplus had trends similar to Sigmoid, we 
don’t show its plots. 

5 Although Sigmoid only guarantees sparsity for regularizations in corollary 1 
(eg. SAE), we find it behaves similarly for corollary 2(eg. mDAE, CAE). 


4. Conclusion and Discussion 

We establish a formal connection between features learned 
by regularized auto-encoders and sparse representation. 
Our contribution is multi-fold, we show: a) AE regulariza¬ 
tions with positive encoding bias gradient encourage spar¬ 
sity (theorem 1), while those with zero bias gradient are not 
affected by regularization coefficient; b) activation func¬ 
tions which are non-decreasing, with negative saturation at 
zero, encourage sparsity for such regularizations (theorem 
2) and that multiple existing activations have this property 
(eg. ReLU, Softplus and Sigmoid); c) existing AEs have 
regularizations of the form suggested in corollary 1 and 2, 
which not only brings them under a unified framework, but 
also shows more general forms of regularizations that en¬ 
courage sparsity. 

On the empirical side, a) bias gradient dominates the effect 
on sparsity of hidden units; specifically sparsity is in gen¬ 
eral proportional to the regularization coefficient when bias 
gradient is positive and remains unaffected when it is zero 
(section 3); b) Constraining the weight vectors during op¬ 
timization to have fixed length leads to better sparsity and 
behaviour as predicted by our analysis. Notice this does not 
restrict the usefulness of the representation leaned by auto¬ 
encoders since we are only interested in the filter shapes 
(weight direction), and not their scale. On the flip side, 
without length constraint, the behaviour of auto-encoders 
w.r.t. regularization coefficient becomes unpredictable in 
some cases, c) explicit corruption (eg. DAE) may have 
advantages over marginalizing it out (eg. mDAE, see sec¬ 
tion 3.1.1) because it captures both first and second order 
effects. 

In conclusion, our analysis combined together unifies ex¬ 
isting AEs and activation functions by bringing them un¬ 
der a unified framework, but also uncovers more general 
forms of regularizations and fundamental properties that 
encourage sparsity in hidden representation. Our analysis 
also yields new insights into AEs and provides novel tools 
for analysing existing (and new) regularization/activation 
functions that help predicting whether the resulting AE 
learns sparse representations. 
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A1 Supplementary Material 

At .1. Supplementary Proofs 

Lemma 1. If assumption 1 is true, and encoding activation function s e (.) has first derivative in [0,1], then dJAE/db ej £ 
[-2o ry /n\\W j\\,2o ry /n\\SVj\\]. 


Proof. For squared loss function Jae, 


9Jae 

db e4 


= 2E* 


ds e (c 


da,j 


(x - W T s e (Wx + b e )) T Wj 


= 2E X 


ds e (t 


•]> T 


Oa. 


vLW, 


where aj = Wjx + bj. Since € [0,1], 


E x 


da e(aj) T W . 
da, x J 


< E„ 


daj 

dSejOj ) . 
da ,• 


IIW,- 


< l|Wj||.E x [||r x ||] 


Let r x denote any one of the eleme nts of r x . Since each element of r x is Ltd. from assumption 1 and r x e 
Jensen’s inequality, E x [||r x || 2 ] < ^/nE x [r x ] = \fria r . Thus, 


E x 


dSejOj) T w 
da ,• x J 




which leads to d J b AE < 2cr r y / n|| Wj ||. We can similarly prove in the other direction get the desired bound. 

° e j 

Theorem 1. Let {W^ G M mxn , ^ j be the parameters of a regularized auto-encoder (X > 0) 
Jrae = Jae + X7Z(W , b e ) 


(15) 

(16) 
using 

(17) 

□ 

(18) 


at training iteration t with regularization term 7£(W, b e ), activation function s e (.) and define pre-activation dj = WjX-f 
b% (thus h) = s e (a t j)). If > 2a r y/n\\Wj ||, where j G {1, 2,..., m}, then updating {W 4 , b^} along the negative 

gradient of Jrae, results in E x [a* +1 ] < E x [a*-] and Var[a* +1 ] = ||W*- +1 1| 2 for all t > 0. 


Proof At iteration t + 1, 


a t+1 = a) — rj- 


3 Jrae 

dW 3 


-X — Tj- 


3 Jrae 
db- A 


for any step size r\. Expanding Jrae , we get, 
-x — p 


t+ 1 t ^Jae 9Jae x 31Z 37Z 

a- = dj — Tj x — rj—RT - T]X x — rjX- 


3Wj 


3b e 


3W, 


3b e < 


Thus taking expectation over x on both sides we get, 

E s [«*«] = E,[a*]-,^£-,A aK 


(19) 


( 20 ) 


( 21 ) 


Notice the terms containing d ^ E and in equation 20 disappear because both terms are already a function of expec¬ 
tation over x (see various auto-encoder regularizations) when we deal with expected cost function. Thus these terms are 
linear in x and hence taking an expectation results in 0. 

From lemma 1, d Q h AE > — 2ey/n\\W j ||, thus if A> 2cr r y / n|| Wj ||, thenE x [a* +1 ] < E x [aj]. 


Finally, Var[a* +1 ] = E x [a* +1 - E x [a* +1 ]] 2 = E x [W* +1 x] 2 = ||W* +1 || 2 


□ 
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Corollary 1. If s e is a non-decreasing activation function with first derivative in [0,1] and 7 Z = J2jLi f (^x[hj]) f or an y 
monotonically increasing function /(.), then 3 A > 0 such that updating {W*, b^} along the negative gradient of Jrae 
results mE x [a*- +1 ] < E x [a*-] and Var[a* +1 ] = ||W*- +1 || 2 for all t > 0. 


Proof We need one additional argument other than theorem 1. 
non-decreasing functions, > 0 in all cases. 

e A 


dhj_ 

daj 


Since both s e (.) and /(.) are 
□ 






Corollary 2. If s e is a non-decreasing convex activation function with first derivative in [0,1] and 7 Z = 

q G N , p G W, then 3A > 0 such that updating {W t ,bg} along the negative gradi- 

t 
3 

^ 1 d 2 hn da 


ent of Jrae, results in E x [a*- +1 ] < E x [a*-] and Var[a*- +1 ] = ||W * +1 || 2 for all t > 0. 


Proof We need one additional argument other than theorem 1 . = E : 


( dh±y 
V da o) 


d^TdbT l' W ill2 • Since s e (.) 


q 2 g (a m> ) 

is a non-decreasing convex function, both ^,2 


> 0 in all cases. 


> 0 and ds ^ a ^ > 0 Maj G M. Finally, = 1 by definition. Thus 

a j ° e j 

□ 


Theorem 2. Let pj denote a lower hound of Pr(/i*- < S m { n ) at iteration t and s e (.) he a non-decreasing function with 
first derivative in [0,1]. If ||W *-||2 is upper hounded independent of X then 35 C M+ and 3T m i n ,T max G N such that 
p\ +1 > v\ VA e S, T min < t < T max . 


Proof From theorem 1, E[a r 1 ] < E [aj] Vf > 0. Define a m i n such that S m [ n = max amin s e (a m [ n ). Thus 3T min G N, such 
that Vt > T m i n , E [dj] < a m i n . Then in the case of non-decreasing activation functions, using Chebyshev’s bound, 


Pr(M < ^min) = Pr(a* < a m in) > Pr(|a‘ - E[a*]| < a min - E[a*]) 


> 1 - 


Var[a*- 


(a min -E[a*]) 2 


Thus p* := 1 — t— lower bounds Pr Chk < ^ m in) Vt > T m i n . Now consider the difference 

3 V^min iH'L a j\) v 3 ' 

m := 


Var[a* +1 ] Var[a*-] 


(a m in — E[ a j +1 ]) 2 ( a min E[a ^-]) 2 

and recall that 

E x [af 1 ]=E x [a‘]- J? ^-r ? A^ 


db e < 


db e 


( 22 ) 


(23) 


(24) 


where both the step size p and are positive and 8Jae / dh ej G [—2cr r y5i|| Wj ||, 2cr r y / n|| Wj ||]. Thus, since Var [afi = 

|| W - 1| 2 , we can always choose a fixed S C M+ such that D(t) < 0 VA G 5 and T min < £ < T max . □ 


Theorem 3. Let {W,b e } represent the parameters of a DAE with squared loss, linear decoding, and i.i.d. Gaussian 
corruption with zero mean and a 2 variance, at any point of training over data sampled from distribution V. Let aj := 
WjX + b ej so that hj = s e (aj ) corresponding to sample x ~ V. Then, 


JdAE = Jae + cr E x 


E((£) IW: 


• 1 \ A a ] 

3=1 \ x J 


ill! 


3 Ak 


( 25 ) 


i= 1 


E ((b d + W T h - x) T W 


T (S 0Wi0W 


+ o(a 2 ) 
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where £ M m is the element-wise 2 nd derivative ofhw.r.t. a and © A element-wise product. 
Proof. Using 2 nd order Taylor’s expansion of the loss function, we get 


^(x, /d(/e( x ))) = f(x, /d(/ e (/Xx))) + ( x “ Mx) T V x ^ + ^(x - /X x ) T V|^ (x - /l x ) + o(c7 2 ) (26) 

where /i x = x. since we assume zero mean Gaussian noise. Thus taking the expectation of this approximation over noise 
yields 

E[^(x, / d (/ e (x)))] = E[f(x, /d(/e(Mx)))] + ^r(S x V|() + o(<7 2 ) (27) 

where E x := E[(x — /i x )(x — /i x ) T ]. Since the corruption is i.i.d., assume the covariance E x = a 2 I, where I is the 
identity matrix. 

Taking expectation over x, we can rewrite equation (27) as 


Jdae = Jae + E x 


i 

2 a ^dx 2 

7=1 4 . 


+ o(<7 2 ) 


Expanding the second order term in the above equation, we get 

d 2 t dh T d 2 i dh dt T d 2 h 
dx 2 dxi dh 2 dx,i dh dx 2 

For linear decoding and squared loss, 


S T S = £(<^wWw’ 


7=1 


(d 2 h 
—- £ 


\da. 2 


© W* © W 1 


(28) 


(29) 


(30) 


where £ M m is the element-wise 2 nd derivative of h w.r.t. a, © represents element-wise product and W* denotes the 
i th column of W. Let vector dh £ M m be defined such that d h. = Mj £ {1,2,..., m}. Then, 


dh T d 2 £ dh 
dxi dh 2 dxi 


2 EE(( d h©( w ) i ) T (w) fe ) 2 

3 = 1 fe=l 


(31) 


where (W) J represents the j th column of W and © denotes element-wise product. Let Dh = diag(dh). Then, 

E ((dh 0 (WY) T (W) k ) 2 = ||(D h W) r W|| 2 , (32) 

j =1 k=1 

Finally, using the cyclic property of trace operator, we get, ||(DhW) T W|||, = tr(W T DhWW T DhW) = 
tr(DhWW T DhWW T ). Thus DAE objective becomes, 

Jdae = Jae + <t 2 E x [tr(D h WW r D h WW r )+ 

E ((b d + W T h - x) T W T (^ © W i © ^ 


+ o(a 2 ) 


(33) 


Upon expansion of the second term above, we get the final form. 


□ 
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Remark 3. Let {W £ M mxn , b e £ M m } represent the parameters of a Marginalized De-noising Auto-Encoder (mDAE) 
with s e (.) activation function, linear decoding, squared loss and a ^ = A Vi £ {1,. .., n}, at any point of training over 
data sampled from some distribution V. Let aj := W^x + b ej so that hj = s e (aj ) corresponding to sample x ~ V. Then, 


JmDAE — JaE + AE X 



Proof For linear decoding and squared loss, AA = 2|| Wj \\\ and ^ = Q^fWji. Thus 


1 V"' 2 v' 9 2 i f dhj \ 2 ii 2 (dhj 

sst) =5>D w illMa^- 


i=l 


. 1 dhj 

3 = 1 J 


Z=1 j = l 


dx* 
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dh« 


2 n 


iiw: 
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dho 


da, 


>i\\i 


(34) 


(35) 


□ 




